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ABSTRACT 

In the standard Huffman coding problem, one is given a set 
of words and for each word a positive frequency. The goal 
is to encode each word w as a codeword c(w) over a given 
alphabet. The encoding must be prefix free (no codeword 
is a prefix of any other) and should minimize the weighted 
average codeword size "}2 w freq(w) |c(w)|. The problem has a 
well-known polynomial-time algorithm due to Huffman [15]. 

Here we consider the generalization in which the letters of 
the encoding alphabet may have non-uniform lengths. The 
goal is to minimize the weighted average codeword length 
freq(w) cost(c(w)), where cost(s) is the sum of the (pos- 
sibly non-uniform) lengths of the letters in s. Despite much 
previous work, the problem is not known to be NP-hard, 
nor was it previously known to have a polynomial-time ap- 
proximation algorithm. Here we describe a polynomial-time 
approximation scheme (PTAS) for the problem. 

1. INTRODUCTION 

Given a set of W of n words with associated probabilities 
or frequencies pi > p2 > ■ ■ ■ > p n > and an encoding 
alphabet E, the prefix coding problem, sometimes known as 
the Huffman encoding problem is to find a prefix-free code 
over E of minimum cost. This problem is very well studied 
and has a well-known 0(n)-time greedy algorithm due to 
Huffman [15] (0(nlog n) if the pi are not sorted in advance). 

Here we consider the generalization of the problem in which 
the letters used for encoding can have different costs. That 
is, letting r — |E|, the r letters have associated costs £\ < 
£ 2 < • • • < £ r and the cost of a codeword is defined to be the 
sum of the costs of its letters (rather than the length of the 
codeword). 
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For an example of the problem we are addressing refer to 
Figure 1. Both codes have minimum cost for the frequencies 
(pi,P2,P3,P4,) = (2,2,1,1) but under different letter costs. 
The code {00, 01, 10, 11} has minimum cost for the standard 
Huffman case of E = {0, 1} in which the length of each letter 
is 1, i.e., the cost of a word is the number of bits it has. The 
code {aaa, aab, ab, b} has minimum cost for the alphabet 
E = {a, 6} in which the length of a "a" is 1 and the length 
of a "6" is 3. 

This generalization is motivated by coding problems in which 
different characters have different transmission times or stor- 
age costs [5; 22; 19; 28; 29]. One example is the telegraph 
channel [10; 11] in which E = {■, — } and £2 = 2£i, i.e., 
in which dashes are twice as long as dots. Another is the 
(a, b) run-length-limited codes used in magnetic and opti- 
cal storage [16; 12], in which the codewords are binary and 
constrained so that each 1 must be preceded by at least 
a, and at most b, 0's. (This example can be modeled by 
the problem studied here by using an encoding alphabet of 
r = b — a + 1 characters {0 fc l : k — a, a + 1, . . . , b} with 
associated costs {£i = a + i — 1}.) 

The literature contains many algorithms for the generalized 
problem. The special case when all the probabilities are 
equal (but not the letter lengths), known as the Varn coding 
problem, is solvable in polynomial-time [29; 1; 7; 25; 13; 6]. 
For the generalized problem, Blachman [5], Marcus [22], and 
(much later) Gilbert [11] give heuristic constructions. Karp 
gave the first algorithm yielding an exact solution (assuming 
the letter costs are integers); Karp's algorithm transforms 
the problem into an integer program and does not run in 
polynomial time [19]. Karp's result was followed by many 
others [21; 9; 8; 23; 3] presenting solutions of cost at most 
OPT + f(h, £ 2 , . . . , It) where OPT is the cost of the op- 
timal code and f(£i, £2, ■ • ■ , £r) is some fixed function of 
the edge costs, with the different algorithms having differ- 
ent /(•) (note that for these results the pi are considered to 
be probabilities that sum to 1). These algorithms essentially 
demonstrate that a generalized version of entropy is a lower 
bound on the code cost and then design algorithms that 
construct codes within an additive function of that entropy. 
They do not imply imply a PTAS for the problem, though, 
even with fixed letter costs. Golin and Rote [12] gave a 
dynamic programming algorithm that produces exact solu- 
tions in 0{n tT+2 ) time for the special case when the li are 
restricted to be integers. Bradford et. al. [24] improved this 
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Figure 1: Two minimum-cost codes for the frequencies (pi,p2,P3,P4) = (2,2, 1, 1) but under different alphabets. 
Codes are represented using the standard tree format with leaves signifying codewords. The left tree is for 
£ = {0,1} with £i = cost(0) = 1 and £ 2 = cost(l) = 1. The right tree is for S = {a, b} with £i = cost(a) = 1 
and £2 = cost(b) = 3. The four codewords encoded by the left tree are {00,01,10,11}; the cost of the code is 
2-2 + 2- 2 + l- 2 + l- 2 = 12. The four codewords encoded by the right tree are {aaa, aab, ab, b}; the cost of the 
code is2-3 + 2- 3 + l- 4 + l- 5 = 21. 



to 0(rT r ) when r = 2. 

For further references on Huffman coding with unequal let- 
ter costs, see Abrahams' recent survey on source coding [2, 
Section 2.7], which contains a section on the problem. 

Despite the extensive literature, there is no known polynomial- 
time algorithm for the generalized problem, nor is the prob- 
lem known to be NP-hard. Before this work, the problem 
was not known to have any polynomial-time approximation 
algorithm. Our main result here is a polynomial-time ap- 
proximation scheme (PTAS) for the problem: 

Theorem 1. Given an instance {(pi),{£j)) of the Huff- 
man coding problem with unequal letter costs, and given a 
positive e, there exists an algorithm that constructs a pre- 
fix code of cost at most (1 + e)OPT; this algorithm runs in 
time ndlog(n) exp(0(ln(l/e) 2 /e 2 )), where d is the number 
of distinct letter costs. 

The algorithm is based on a new relaxation of Huffman cod- 
ing with unequal letter costs. The relaxation, called the It- 
prefix code problem, allows codewords of cost more than k to 
be prefixes of other codewords. The algorithm uses grouping 
and enumeration techniques to find a near-minimum-cost k- 
prefix code (where is a constant depending only on e), 
and then converts this fc-prcfix code into a true prefix code 
increasing the cost by at most a 1 + O(e) factor. 

The techniques introduced in this paper can also be used 
to construct PTAS's for the regular-language prefix-coding 
problem, a different generalization of Huffman coding that 
asks for a minimum-cost prefix code under the additional 
restriction that all codewords belong to a given regular lan- 
guage £. As one example, the binary codes, constrained 
so all codewords must end in a 1, are used for group test- 
ing and the construction of self-synchronizing codes [4; 26]. 
As another example, binary codes whose codewords con- 



tain at most a specified number of l's are used for energy 
minimization of transmissions in mobile environments [27]. 
Algorithms (other than exhaustive search) for the regular- 
language prefix-coding problem generalize [12] and run in 
time n e ' s ' £ '' ) where S(C) is the number of states in the 
smallest deterministic finite automaton that accepts C. In 
this extended abstract we do not discuss how the techniques 
here extend to that problem. 

Alphabetic coding is like the generalized problem considered 
here, but with an additional constraint on the code: the 
codewords must be chosen in increasing alphabetic order 
(with respect to the words to be encoded). This problem 
arises in designing testing procedures in which the time re- 
quired by a test depends upon the outcome of the test [20, 
6.2.2, ex. 33] and has also been studied under the names 
dichotomous search [14] or the leaky shower problem [18]. 
Alphabetic coding has a polynomial-time algorithm [17]. 

2. NOTATIONS AND DEFINITIONS 

A problem instance is specified by a set W of n words with 
associated frequencies p\ > p 2 > ■ ■ ■ > p n > 0, an alphabet 
E of r > 2 letters with associated costs £\ < £2 < ■ ■ ■ < £ r , 
and an e > 0. The Huffman coding problem with unequal let- 
ter costs is the problem of finding a prefix code of minimum 
cost. 



Definitions 1. A code (usually denoted c) is an infec- 
tive map from W to £*. Its image c(W) is called the set of 
codewords of c. A set S C S* is prefix-free if no element 
of S is a prefix of any other element of S; a prefix code 
is one whose set of codewords is prefix-free. The cost of a 
code c is YL7=iP' cost(c(wi)) , where cost(x) is the sum of the 
costs of the letters of x. 

Unless otherwise stated, all the codes considered here are 
ordered, i.e. the map c assigns codewords of smaller costs 
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to words of larger probability: cost(c(wi)) < cost(c(w2)) < 
••■ < cost(c(w„)). Clearly, any optimal code must be or- 
dered. 



Definition 2. A fc-prefix code is a code in which no 
codeword of cost less than k is a prefix of any other codeword. 

One of the main reasons that prefix-codes are useful is that 
they are uniquely decipherable. A fc-prefix code is in gen- 
eral not uniquely decipherable and therefore not particularly 
useful by itself. The problem of designing a fc-prefix code is 
introduced and used here solely as an intermediate tool for 
solving the original problem. 

We generally use w to denote a word to be encoded, x to 
denote a potential codeword (a string in £*), cost(a;) to 
denote the sum of the costs of the letters in x's, and \x\ (the 
size of x) to denote the number of letters. To distinguish 
the given words W from the potential codewords E*, we 
call the former words and the latter strings. We use d to 
denote the number of distinct letter costs. 

The following assumption about the input is convenient: 

Assumption 1. Each li for i > 2 is an integer multiple 
of e, I2 equals 1, and e is either an integer multiple of £\ 
or evenly divides l\, Thus, all codeword costs are integer 
multiples of min{i?i , e} . 

It is also without loss of generality: 

Lemma 1. The following reduction reduces any problem 
instance to an instance satisfying Assumption T. 

1. Scale the li 's uniformly to make £2 = 1. 

2. Decrease e to its next smallest multiple, or divisor, of £\ . 

3. Increase each ii fori > 2 to its next largest multiple of e. 
4- Scale the li 's, and e, uniformly to make £2 — 1. 

Proof. Step 2 decreases e by at most a factor of 2. Step 
3 increases the cost of any solution by at most a 1 + e factor. 
Step 4 decreases e by at most a factor of 1 + e. □ 

In the main sections (particularly Lemmas 7 and 8) we also 
assume that £\ > e/n. The special case l\ < e/n is easy and 
is dealt with in Section 7. 

Before we explain the main algorithm (Algorithm 3), we first 
explain two subroutines used in that algorithm: Algorithm 1 
for constructing a so-called leveled fc-prefix code meeting cer- 
tain constraints, and Algorithm 2 for converting such a code 
into a true prefix code. 

3. FINDING AN OPTIMAL ^-PREFIX CODE 

Algorithm 1 finds an optimal fc-prefix code meeting some 
given constraints. The first constraint is that the code should 
be leveled: 



Algorithm 1 - builds a leveled fc-prefix-free set of code- 
words given the level codeword and the number of code- 
words per level. 

INPUT: Letter costs, directed graph D, constraints 

(/(o),/(i),...,/((fc-i)A)). 

OUTPUT: leveled fc-prefix code with f(i) codewords in 
each level i > 1 and a codeword in level zero of size /(0) 
(if /(0) > 0). (Or "inconsistent" if no such code exists.) 

1: S<-0 

2: For any node I of D, define vs(l) to be the number of 
strings of cost £ having no prefix in S. (The algorithm 
will build a table v with v[£] — vs{£) as it proceeds.) 

3: If /(0) > 0, then S <- {a /(0) } (a is the smallest letter). 

4: For each node £ on level 0, initialize v[l]: 

v[l] = 1 if I < /(0) or /(0) = 0; v[l\ = otherwise. 

5: for 4 = 1,2, . . . , (fc - l)/e do 

6: For each node £ of D in level i, by order of increas- 
ing costs, initialize v[£] using the recurrence vs(£) = 
HUvsil-li). 
7: Let I = 1 + it - min{£i, e}. If v[£] < f(i), then return 
"inconsistent". Otherwise, choose f(i) codewords of 
cost I and add them to S. Decrement v[£] by f(i). 
8: Complete S by adding n — \S\ strings of minimum cost 
among the strings of cost > fc that don't have a prefix 
of cost < fc in S. If 15*1 < n and there aren't any such 
strings, return "inconsistent". (Find the n — \S\ strings 
by extending D as needed beyond cost fc.) 
9: Return the set of codewords S. 



Definitions 3. For i e {1, . . . , (fc — l)/e}, define the ith 
level of E* to be the set of strings x such that 1 + (i — l)e < 
cost(x) < 1 + it. Define level to be the set of strings that 
cost less than 1. 

A code is maximal within level i if every codeword in 
level i is of cost 1 + ie — min{£i, e}. 

A code is leveled if it is maximal within all levels. 

Note that level can only contain words of a*, where a is 
the letter of cost l\ . Note also that by assumption either (i) 
l\ is an integer multiple of e, in which case each level i > 
consists of the strings of cost 1 + (i — l)e and every code is 
trivially maximal, or (ii) £\ evenly divides e, in which case 
every string in level i has cost 1 + (i — l)t + jl\ for some 
integer j < n. 

The second constraint that the constructed codeword must 
meet is specified by a tuple / = (/(0), • • • , /((fc — l)/e)) of 
integers, with the following meaning. 

1. If /(0) = 0, then there must be no codeword in level 
0; if /(0) > 0, then the set of codewords must contain 
the codeword a^ ^ from level 0. 

2. For every level i > 1, the set of codewords must contain 
exactly f(i) codewords in level i. 

Algorithm 1 uses a directed graph D that is computed by 
Algorithm 3 and given to Algorithm 1 as input. The nodes 
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Algorithm 2 — converts a fc-prefix code into a prefix code 

INPUT: letter costs; fc-prefix code c 
OUTPUT: prefix code c' 

1: Let a and b be the letters of cost £\ and £2, respectively. 

2: For positive integer i, define enc(i) = b^bib^b^bjbjab 
where b'^ . . . b'j is obtained from the binary represen- 
tation 6162 •• • bj of i by replacing each "0" with "a" and 
each "1" with b. Define enc(0) = ab. 

3: for each codeword of cost > fc do 

4: Let a be the smallest prefix of cost > fc and j3 the re- 
maining suffix. Replace the codeword by aenc(i)(3b, 
where i is the number of b's in (3. 

5: Return the modified code. 



of D are all possible codeword costs in [0, fc], with an arc 
from £ to £' if and only if t' - t € {£i, 1, • • • , £ r }. 

Here is the performance guarantee of Algorithm 1: 

Lemma 2. Given a constraint f = (/(0), . . . , /((fc-l)/e)) ; 
2/ f/iene exists a leveled k -prefix code consistent with the con- 
straint, then Algorithm 1 constructs one of minimum cost. 

Proof. It is straightforward to verify that Algorithm 1 
correctly computes each vs(£) and produces a leveled fc- 
prefix code consistent with the constraint, if one exists. 

Among the codes meeting the constraints of being leveled 
and consistent with /, the chosen code has minimum cost 
because the cost of codewords on levels 0, 1, . . . , (k — l)/e is 
determined by those constraints, and, given the codewords 
in those levels, the chosen code takes a set of codewords of 
cost > k of minimum possible cost. □ 

4. CONVERTING A ^-PREFIX CODE INTO 
A PREFIX CODE 

Algorithm 2 converts a fc-prefix code into a prefix code. The 
next lemma captures what we need to know about Algo- 
rithm 2 in order to use it in the main algorithm (Algo- 
rithm 3): 

Lemma 3. Given any k-prefix code c of cost a, Algo- 
rithm 2 constructs a prefix code c' of cost at most a[l + 
£ 2 (5 + 21og 2 fc)/fc]. 

Proof. First we analyze the cost. Let c(w) = a/3 and 
c'(w) = aenc(i)/3b, respectively, be an original and modified 
codeword in Algorithm 2. From i < cost(/3) it follows that 
cost(c'(w)) < cost(c(w)) + £-2[5 + 21og 2 cost(c(w))]. Since 
cost (c(w)) > k if the codeword is modified, each modified 
codeword costs at most 1 + £2 (5 + 2 log 2 k) jk times as much 
as the original. 

Next we show that c' is prefix free. Suppose c'(v) is a prefix 
of c'(w) for some v, w € W. Since the original code was 
fc-prefix free, it must be that 

c'(v) = aenc(i)/3b 
and c'(w) = ^yenc(j)8b 



where a and 7 each have cost > k but have no proper prefix 
of cost > fc, and where i and j are the number of b's in 
j3 and 8, respectively (as in Algorithm 2). Since c'(v) is a 
prefix of c'(w), a is a prefix of c'(w), which means a = 7. 
Thus, enc(i) is a prefix of enc(j)5b. Since every letter in 
enc() is doubled except the last two, it must be that i — j. 
Thus, /3b is a prefix of 8b. But (since i = j) [3 has the same 
number of b's as 8, so it must be that [3 = 8. Finally, we 
can conclude that a/3 = j8. Since these were the original 
codewords assigned to v and w, it must be that v = w. 

□ 

5. THE MAIN ALGORITHM 

The main algorithm is Algorithm 3. It consists of: 

1. Some preprocessing (Steps 1, 2, 3, and 4). 

2. Calling Algorithm 1 for O e (l) different constraint tu- 
pies (/(0), . . . , /((fc - l)/e)) (Steps 5, 6, 7). 

3. Choosing the best fc-prefix code among those output 
by Algorithm 2, transforming it into a prefix code 
(Step 9), and returning the resulting code. 

Next we analyze the cost of the code produced. 

Lemma 4. Step 4 partitions W into 0(k/e 2 ) groups. 

Proof. Take any two consecutive groups other than G\. 
The cumulative probability of the words in the two groups is 
at least (1 — pi)e 2 /(2fc). Thus there can be at most l + 4fc/e 2 
such groups. □ 

Lemma 5. OPT > 1 - pi. 

Proof. At most one codeword can belong to a*. All the 
other codewords contain at least one letter which costs at 
least 1, and their cumulative frequency is at least 1 — p m ax = 
1-Pi. □ 

We now focus on Steps 5 through 8. The analysis of Algo- 
rithm 1 implies that these find a code that is optimal among 
leveled fc-prefix codes meeting the grouping constraints (map- 
ping elements within each group to the same level). The 
next lemma implies that this code has cost at most l + 0(e) 
times the minimum cost of any fc-prefix code. 

Lemma 6. For any k-prefix code c, there exists a leveled 
k-prefix code c' that maps group elements within each group 
to the same level, and such that cost(c') < cost(c)(l + 0(e)) . 

Proof. Let c be an optimal fc-prefix code. We modify c 
level by level so that for each i, c is maximal within level 
i and so that level i contains all or none of the elements of 
each group, as follows. Since level contains at most one 
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Algorithm 3 — finds a (1 + 0(e))-optimal prefix code 

INPUT: Set W of words with frequencies pi > P2 > • • • > 

p n > 0; letter costs t x < ii < ■ ■ ■ < £ r \ e > 0. 
OUTPUT: prefix code of cost at most (1 + 0(e))OPT 

1: Adjust e and the £,'s so that £2 — 1, e is an integer divisor 
or multiple of £1 , and each ii for i > 2 is a multiple of e 
(as in Lemma 1). 

2: Choose fc = Q(log(l/e)/e) so fc — 1 is a multiple of e. 

3: Construct a directed graph D whose nodes are all code- 
word costs in [0,k], with an arc from £ to I' iff £' — I 6 
{£1 , £2 , • • • , £r } • Do this by enumerating the vertices and 
edges of D using breadth-first search from the root (node 
0). 

4: Greedily partition W into groups. 

• The first group is G\ = {mi}. 

• Take i maximum such that pi > (1 — pi)e 2 /(2fc). 
Then G2 = {^2}, G3 = {w :i }, . . ., and G; = {wi}. 

• While W is not empty, greedily take for the next 
group {wj , Wj+i ,w e }, where pj + ■ ■ ■ + pe < 
{l-pi)e 2 /k <pj + \-pe+i- 

5: By exhaustive search, guess whether there is a codeword 
of cost < 1 and what its size is (call it /(0)), and, for 
each i = 1, 2, . . . , (fc — l)/e, guess which groups will be 
assigned to level i (i.e., with costs in [l + e(i— 1), 1 + ei)) 
and let f(i) denote the total number of words in those 
groups. 

6: for each such guess (/(0), . . . , f((k — l)/e)) do 

7: Use Algorithm 1 to construct an optimal fc-prefix code 

consistent with the guess (if one exists). 
8: Let c be the minimum-cost code constructed. 
9: Using Algorithm 2, convert c into a prefix code c'. 
10: Return c' . 



codeword, and the first group contains exactly one element, 
this is already true for level 0. Assume that we have already 
modified c to guarantee those properties for levels up to i— 1, 
and consider level i. 

Leveling phase for level i. First modify c so that it is 
maximal within level i by taking every codeword z in level 
i and padding it with enough a's (i.e., replace 2 by za 3 ) so 
that its cost is 1 + ie — min{£i,e}, i.e. so that adding one 
more letter a would move the codeword out of level i. (Here 
we use the assumption that every letter cost is a multiple 
of e, with the possible exception of £\ which then evenly 
divides e.) 

Grouping phase for level i. Next modify the code so that 
level i contains all or none of the elements of each group (i.e., 
no group is "split"). By induction, no group is split in levels 
0, . . . ,i — 1. Since the code is ordered, at most one group 
can be split from level i. Take each codeword in that group 
and on level i and pad it with an a to move it out of level i. 
Then, if necessary, reorder the code (on levels greater than 
i). 

When this construction has been done for every i, the orig- 
inal fe-prefix code c has been changed into a leveled fc-prefix 



code c' that maps elements within each group to the same 
level. Next we show that the cost of c' is 1 + O(e) times the 
cost of c. 

First consider a codeword that is padded in the leveling 
phase for some level i but not in the grouping phase for that 
level. Since each codeword has this happen at most once, 
and the operation increases the cost of the codeword by at 
most e times the original cost of the codeword, the total 
increase in cost due to such operations is at most e times 
the original cost of the code. 

Next consider codewords that are padded in the grouping 
phases. In the leveling and grouping phases for a given level 
i, the cost of a codeword is increased by at most e + £\ < 2. 
Since only one group is padded in the grouping phase, and 
(by the grouping criterion) the group has total probability at 
most (1 —pi)e 2 /k, the total increase in cost of all codewords 
in the group during the leveling and grouping phase for level 
i is at most 2(1 —pi)e 2 /k. Since there are O(kfe) levels, the 
total increase in cost due to these operations for all levels is 
0((1 — pi)e). By Lemma 5, this is O(e) times the original 
cost of the code. 



□ 

Together with Lemma 2, Lemma 6 implies that at the end 
of Step 8, Algorithm 3 finds a fc-prefix code that has cost at 
most (1 + 0(e))OPT. Lemma 3 implies that, for k chosen 
as in Step 2, the main algorithm converts the fc-prefix code 
into a prefix code that also has cost at most (l + 0(e))OPT. 
This concludes the proof of the performance guarantee in 
Theorem 1 for the case £\ > e/n. 

6. ANALYSIS OF RUNNING TIME 

We analyze the directed graph D which was built in Step 3 
of Algorithm 3 and used by Algorithm 1 . 

Lemma 7. Graph D has at most nk/e nodes and dnk/e 
edges, where d is the number of distinct letter costs among 

{&,..., M- 

Proof. From Step 1 of Algorithm 3 all letter costs other 
than £\ are multiples of e, and £\ is either a multiple of e or 
evenly divides e. Moreover, we've assumed £\ > e/n. Thus 
each of the fc/e levels has at most n distinct costs. Thus D 
has at most nk/e nodes; each node has outdegree at most 
d, the number of distinct letter costs. □ 

In particular, in Step 3 of Algorithm 3 graph D can be 
constructed in time 0(ndk/e). Using an array with a bucket 
for each possible codeword cost, we can access the nodes of 
D in constant time. 

Lemma 8. Algorithm 1 can be implemented to run in time 
0(ndk/e). 

Proof. Since £\ > e/n, there are 0(n/e) nodes on level 
0, so Step 3 of Algorithm 1 takes time 0(n/e). 
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The loop of Step 5 is iterated k/e times. The total time for 
Step 6, is at most O(l) per edge, therefore at most dnk/e 
total. 

Total time for Step 7 is 0(n) (O(l) per codeword). (The 
exact cost of each iteration depends on details of implemen- 
tation of the codeword set S. We use an implicit represen- 
tation of S by a tree, and we highlight an edge e of D and 
give it a "codeword weight" w(e) if it is used to build w(e) 
codewords of S. This data structure can be used to obtain 
the desired time bounds.) 

Finally, Step 8 is implemented by doing a breadth-first search 
to extend D beyond cost k, starting from all the nodes of 
D which have vs{£) > 1. Since the outdegree of any node 
is bounded by d, and we stop as soon as we have identi- 
fied at most n shortest eligible strings of cost > k, this has 
complexity 0(nd). 

Overall, the algorithm thus has running time 0(ndk/e). □ 

Remark: Constructing an explicit representation of S would 
require, in general, complexity Q(n 2 ), since the optimal code- 
words can have total length Q(n 2 ) (for example, if S = 
{b, ab, a 2 b, . . . , a n b}, as might be the case when <i<l). 

Lemma 9. Algorithm 2 can be implemented in time 0(n). 

Proof, (sketch) First represent S with a tree as in the 
usual Huffman encoding problem, compacting the represen- 
tation as you go so that every internal node has at least two 
children, and so the total number of nodes is 0(n). Then 
calculate the number of &'s recursively in time 0(n). Fi- 
nally, modify the tree to insert the extra letters. This can 
be implemented in time 0(n). □ 

We now go back to Algorithm 3. It is easy to see that 
everything outside Step 6 takes time 0(nd). From Lemma 4, 
there are only 0(k/e 2 ) groups, and 0(k/e) levels. Since 
the assignment of groups to levels respects the ordering of 
both, the map is determined by knowing, for each level, the 
maximum index of any group assigned to it. Thus there 
are exp(0(ln(fc/e)fc/e)) maps from groups to levels, hence 
exp(0(ln(fe/e)fe/e)) choices for (/(l), . . . , f(k/e)). 

How about /(0): how many possibilities must we try? As 
written, the algorithm tries every possibility in level 0, which 
could mean as many as O(n), however it is easy to modify 
the algorithm so that we only try 0(log(n)/e) possibilities 
for /(0): indeed, rounding costs up to the nearest power of 
(1 + e), it is enough to try /(0) = 0, 1, ... , 1/e, (l/e)(l + 
£ ),(l/e)(l + e) 2 ,...,l. 

Since each iteration takes time 0(ndk/e), and there are 
cxp[0(ln(fc/e)fc/e)] log(n)/e iterations, we obtain a complex- 
ity of (ndlog(n)/e 2 ) exp(0(ln(fc/e)fc/e)). Expanding k and 
simplifying, this is ndlog(n) exp(0(ln( 1/e) /e )). This com- 
pletes the proof of Theorem 1 for the case l\ > e/n. 

7. DEALING WITH THE CASE e 1 < e/N 



Recall that a is the shortest letter and b the second shortest 
letter, of cost ii = 1. 

Lemma 10. If li < e/n, then there exists a prefix code 
c with the following properties. The cost of c is at most 
1 + e times the minimum cost of any prefix code. There is 
a unique codeword of the form a 10 where io is of the form 
+ e) j J with j = 0(log(n)/e). Every other codeword has 
one of the following forms: 

1. ba 3 b for some j < n, 

2. a 3 xa n where x € £ — a and j < io- 

Proof. Given a minimum-cost prefix code, let io be such 
that a H ' is the unique codeword in a*. Modify the code as 
follows. 

1. Replace codeword a l ° by a 11 , where ii is the next largest 
integer of the form [(1 + e) 3 J . 

2. For each i (1 < i < n), if the ith codeword has two or 
more occurrences of letters other than a, replace the code- 
word by ba % b. 

3. Any codeword not modified in Step 1 or 2 must be of the 
form a 3 xa* where x £ E — a and < j < io. (There can be 
at most one such starting with a J x.) Replace it by a j xa n . 

This gives a prefix code of the desired form. It is easy to 
verify that the cost of each codeword is increased by at most 
a 1 + e factor. □ 

The algorithm will first find the minimum-cost code among 
those of the form described in Lemma 10 as follows: for each 
of the 0(log(n)/e) choices for io, consider the ordered code 
whose codewords are the n least costly strings in the set 

{a ia } U {ba j b : j < n} U {a 3 xa n : j < i }. 

Among the codes considered, take the one of minimum cost. 
By the lemma, the cost of this code will by at most 1 + e 
times the minimum cost of any prefix code. 

Under Assumption 1, each iteration can be implemented in 
0(n + d) = 0(n) time, so for this special case the algorithm 
takes 0(nlog(n)/e) time. 
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